Replicated Paper:
Pigmentocracy in the Americas: How is Educational Attainment Related to Skin Color? By Edward Telles and Liza Steele
Paper Overview:
This Insights report addresses the question of whether educational attainment, a key indicator of socioeconomic status, is related to skin color in Latin America and the Caribbean. Based on data from the 2010 AmericasBarometer, our analysis shows that persons with lighter skin color tend to have higher levels of schooling than those with dark skin color throughout the region, with few exceptions (2012:i)
Figure 1 in the paper is an image to aid with the understanding of a variable(colorr):
(There is nothing to replicate)
Here is the set up for Figure 2 by Telles and Steele:
In Figure 2, we show the relation between skin color and schooling for 23 countries in the 2010 AmericasBarometer.4 Our dependent variable is educational attainment, which is based on the grade level completed by the respondent.
Figure 2 graphically represents the relation between educational attainment and skin color in four regions, where the lightest persons are near 1 and the darkest near 11. We present data points only where there are at least 30 persons from the survey. Categories with fewer than 30 respondents are combined with contiguous groups (for example, 1’s are combined with 2’s in many countries, in which case the combined category is reported as a 2) (2012:2)
Here is the figure from the paper:
Here is our progress so far:
#####################################################################################################################
##### Assignment: Replication Project (Part 1) #####
##### Due Date: 3/15/2015 #####
##### Authors: Megan Blanchard and Kalyani Jayasankar #####
##### Input: clean2010data.dta,and Images in Images folder of project #####
##### Output: This code does anaylsis on the clean data file. We create 3 figures : #####
##### Figure 3: Bar Graph of Average Ed by Skin Color for each country #####
##### Appendix: OLS Models Predicting Years of Schooling in Latin American Countries, 2010 #####
##### Figure 4: Effects of Skin Color and Other Factors on Educational Attainment #####
##### Notes: #####
##### #####
#####################################################################################################################
#install.packages("stargazer")
#install.packages("gridExtra")
#install.packages("xtable")
library(stargazer)
library(foreign)
library(dplyr)
library(ggplot2)
library(broom)
library(gridExtra)
library(xtable)
setwd("~/Desktop/R working directory/RepProj")
clean.2010<-read.dta("clean2010data.dta")
####################################################################################################################
#####Creating Figure 2: Relation between Skin Color and Educational Attainment in Latin America and the Caribbean
#####facet by the four specified regions,
#####Note: Categories with fewer than 30 respondents are combined with continuous groups
#####(for example, 1’s are combined with 2’s in many countries, in which case the combined category is reported as a 2)
####################################################################################################################
#Note: for Figure 2 below, we still have 1 major problem to resolve:
#recode the values of colorr to get the groups to collapse
#(the plot printed here is the uncoded values)
#get counts for colorr by country in temporary dataset
unique(clean.2010$colorr)
test<-xtabs(formula = ~colorr + pais, data=clean.2010, exclude= "97 Colud not be classified")
test<-as.data.frame(test)
test$colorr<- as.numeric(test$colorr)
#the loop that crashes freqs, first for low end values of colorr, then for high end to flag categories to collapse
for(country in unique(test$pais)){
minInd = min(which(test$pais == country)) - 1
for (colorr in 1:6) {
if (test[colorr +minInd, "Freq"] <= 30) {
test[colorr +minInd + 1, "Freq"] <- test[colorr + minInd, "Freq"] + test[colorr + minInd + 1, "Freq"]
test[colorr +minInd, "Freq"] <- 0
}
}
}
for(country in unique(test$pais)){
minInd = min(which(test$pais == country)) - 1
for (colorr in 11:6) {
if (test[colorr +minInd, "Freq"] <= 30) {
test[colorr +minInd - 1, "Freq"] <- test[colorr + minInd, "Freq"] + test[colorr + minInd - 1, "Freq"]
test[colorr +minInd, "Freq"] <- 0
}
}
}
test
#check to make sure there are no vlaues between 1 and 29
filter(test, Freq < 30 & Freq > 0)
#Make a dataframe of what the minimum & maximum values are by country
min_max <- test %>%
group_by(pais) %>%
filter(Freq > 0) %>%
select(colorr) %>%
summarise(min = min(colorr), max = max(colorr))
min_max
#now, this is what we want to create, but in a loop for all countries that calls the vectors from min_max dataframe
test %>%
filter(pais=="Guatemala") %>%
mutate(colorr_recode= ifelse(colorr <= 3, 3, ifelse(colorr >= 8, 8, colorr)))
#then use mapvaluse (plyr) to create a recoded colorr value with correct collasped categories
#sample code:
#clean.2010 $colorr_recode <- NA
#for (country in unique(recode_df$pais)) {
#clean.2010 $colorr_recode <- mapvalues(clean.2010$colorr, from= recode_df[ , "colorr"], to= recode_df[ ,"colorr_recode"])
#}
Here is the set up for Figure 3 by Telles and Steele:
Figure 3 shows the mean levels of schooling for the residents with the lightest skin (1-3) compared to those with darkest skin (6+) in all 23 countries, ordered by the size of the average difference between the two. Figure 3 also presents confidence intervals around these means, given that there is a margin of error for these population samples, as there is in all survey samples of large populations (2012:4)
Here is the figure from the paper:
Here is our replication:
Here is our code:
####################################################################################################################
#####shows the mean levels of schooling for the residents with the lightest skin (1-3)
#####compared to those with darkest skin (6+) in all 23 countries,
#####ordered by the size of the average difference between the two.
####################################################################################################################
#prep for Figure 3
#Create a new dataframe pais_ed
pais_ed <- clean.2010 %>%
filter(!is.na(colorr), !is.na(ed), tone != "medium") %>%
group_by(pais, tone) %>%
select(pais, tone, ed) %>%
mutate(se= sd(ed)/sqrt(length(ed)),
ed_n= length(ed),
ci.lower = (mean(ed) - 1 * qt(.975, (ed_n - 1)) * se),
ci.upper = (mean(ed) + 1 * qt(.975, (ed_n - 1)) * se)) %>%
group_by(pais, tone, ci.lower, ci.upper) %>%
summarise(mean_ed= mean(ed))
#Note for reference- formulas for finding the CI:
# CI = mean(x) + c(-1,1) * qt(0.975, n-1) * se, where se = sd(x)/sqrt(n)
#Note: for Figure 3 below, we still have 3 problems to resolve:
#1)specify the demensions in the code
#2)get scale to extend past 10 on axis
#3)reorder the position or countries based on difference in mean between dark and light skin
#to change plot order of facet grid, change the order of variable levels with factor()
#ex code: levels = c("Ideal", "Very Good", "Fair", "Good", "Premium"))
#Figure 3
fig3 <- ggplot(pais_ed, aes( x = factor(tone), y = mean_ed, fill = tone))+
ylim(0, 14) +
ylab("Years of Schooling") + xlab("") +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin = ci.lower, ymax = ci.upper)) +
facet_wrap(~ pais, ncol = 2) +coord_flip() +
geom_text(aes(label= round(mean_ed, 2)), size=3.5, hjust=-.7) +
scale_fill_brewer(palette="Greens") +
theme(legend.position = "none") +
scale_x_discrete(labels = c("Light Skin", "Dark Skin")) +
#scale_y_continuous(breaks= seq(0, 14, by=2)) +
ggtitle("Average Eduacational Attainment for Persons with \n \nDarkest and Lightest Skin Colors in the Americas, 2010") +
theme(plot.title = element_text(lineheight=.6, face="bold", size = 12),
strip.text.x = element_text(face = "bold", hjust= 0),
strip.background = element_rect(colour="white", fill="white"),
panel.background = element_rect(fill = 'white', colour = 'white'),
axis.text = element_text(colour = "black"),
axis.title = element_text(face= "bold"))
fig3.note <- arrangeGrob(fig3,
sub = textGrob("95% Confidence Interval (Design-Effect Based) \nSource: AmericasBarometer by LAPOP",
x = .1, hjust = 0, vjust= 0,
gp = gpar(fontface = "italic", fontsize = 10)))
Here is the set up for the Appendix and Figure 4 by Telles and Steele:
Since other factors besides color may affect years of schooling, we run a regression analysis predicting years of education by skin color, as well as class origin, age, sex, urban/rural residence and country of residence. We run the regression model only for the eight countries (Bolivia, Brazil, Colombia, Dominican Republic, Ecuador, Guatemala, Mexico and Peru) in which the class origin data are available[…] The results of the OLS regression analysis are shown in the first column of the Appendix and are summarized graphically in Figure 4. In order to compare the relative sizes of the effects, the figure presents standardised coefficients (2012:5)
The second OLS regression model reported in the Appendix includes interactions between skin color and an indicator variable for each country, using Brazil as the country of reference (2012:6).
Here is the Appendix Table from the paper:
Here is our replication:
Here is the Figure 4 from the paper:
Here is our replication:
Here is our code:
####################################################################################################################
#####Creating Appendix Table: OLS Models Predicting Years of Schooling in Select Latin American Countries, 2010
#####And Figure 4. Effects of Skin Color and Other Factors on Educational Attainment
####################################################################################################################
#Note for the Appendix Table and Figure 4: We have 3 problems to resolve:
#1)finding how parental occupation is coded so it can be added to the model (currently excluded from anaylsis)
#2)finding out scaling for non numerics in the regression
#3)how to apply weights to pais
#run regression models 1 and 2
# data prep
clean.2010$colorr[clean.2010$colorr=="97 Colud not be classified"] <- NA
clean.2010$colorr <- factor(clean.2010$colorr)
clean.2010$colorr <- as.numeric(clean.2010$colorr)
#select correct countries for analysis in new data frame, make categorical vars numeric
eight_pais <- clean.2010 %>%
filter(pais %in% c("Brazil", "Mexico", "Guatemala", "Colombia", "Ecuador", "Bolivia", "Peru", "Dominican Republic")) %>%
mutate(Female = ifelse(q1 == "Female", 1, ifelse(q1 == "Male", 0, NA)),
Urban = ifelse(ur == "Urban", 1, ifelse(ur == "Rural", 0, NA)))
#make pais a factor
eight_pais$pais <- factor(eight_pais$pais)
#make Brazil the reference category
eight_pais$pais <- relevel(eight_pais$pais, ref = "Brazil")
#check
levels(eight_pais$pais)
#model 1
?lm
reg1 <- lm(scale(ed)~ scale(colorr) + scale(parent_occ) + scale(Female) + scale(q2) + scale(Urban) + pais, data=eight_pais)
model1<- summary(reg1)
model1_Rsquared <- model1$r.squared
model1_fstat <- model1$fstatistic["value"]
#check to make sure the scaling works to standardize the coefficients- do by hand, then compare using "scale"
eight_pais$ed.std<- (eight_pais$ed - mean(eight_pais$ed, na.rm=TRUE))/ sd(eight_pais$ed, na.rm=TRUE)
test_model <- lm(ed.std ~ scale(colorr) + scale(parent_occ) + q1 + scale(q2) + ur + pais, data=eight_pais)
#(yes! it works)
#model 2
reg2<- lm(scale(ed)~ scale(colorr) + scale(parent_occ) + scale(Female) + scale(q2) + scale(Urban) + pais + pais:scale(colorr), data=eight_pais)
model2<- summary(reg2)
model2_R_squared <- model2$r.squared
model2_Fstat<- model2$fstatistic["value"]
#prep for table
labels = c("Skin color", "Parental Occupation", "Female", "Age", "Urban", "Mexico", "Guatemala", "Colombia", "Ecuador", "Bolivia", "Peru", "Dominican Republic",
"Interaction: Mexico X skin color", "Interaction: Guatemala X skin color", "Interaction: Colombia X skin color", "Interaction: Ecuador X skin color", "Interaction: Bolivia X skin color", "Interaction: Peru X skin color", "Interaction: Dominican Republic X skin color")
#Appendix Table
?stargazer
stargazer(reg1, reg2,
type= "html",
digits= 3,
covariate.labels = labels,
column.labels = c("Model 1", "Model 2"),
no.space = TRUE,
dep.var.caption = "",
dep.var.labels = "",
title = "Ordinary Least Squares Models Predicting Years of Schooling in Select Latin American Countries, 2010",
out="Appendix_replication.html")
# Figure 4
#prep for graph
#rerunning regreesion for correct order in graph
model1.graph <- lm(scale(ed) ~ scale(colorr) + + scale(parent_occ) + scale(Urban) + scale(q2) + scale(Female) + pais, data=eight_pais)
model1.graph$coefficients
#creating data frame of coefficients and CI's
graph.coef <- summary(model1.graph)$coefficients[2:6, 1]
model1.CI <- confint(model1.graph)
model1.CIlower <- model1.CI[2:6 , 1]
model1.CIupper <- model1.CI[2:6 , 2]
labels = c("Skin Color","Parental Occupation", "Urban", "Age", "Female")
frame <- data.frame(variable = labels,
coefficient = graph.coef,
ci.lower = model1.CIlower,
ci.upper = model1.CIupper)
#reorder variables
frame$variable.order <- factor(frame$variable, levels = c("Female", "Age", "Urban", "Parental Occupation", "Skin Color"))
#Create Figure 4
fig4<- ggplot(frame, aes(y = coefficient, x = variable.order)) + geom_point() +
geom_pointrange(aes(ymin = ci.lower, ymax = ci.upper)) +
geom_hline(yintercept = 0, color="dark green", size= 1) +
coord_flip() +
ylab("95 Percent C.I. (Design -Effects Based)") + xlab("") +
theme_classic() +
geom_text(aes(x=5, y=.2, label= "R-Squared:0.325 \nF=577.823 \nN=14,399 "), size= 4.5) +
ggtitle("Effects of Skin Color and Other Factors on\n\n Educational Attainment in Select Latin American \n\nCountries") +
theme(plot.title = element_text(lineheight=.6, face="bold", size = 12),
axis.text = element_text(colour = "black", face= "bold"),
axis.title = element_text(face= "bold"))
fig4.note <- arrangeGrob(fig4,
sub = textGrob("Source: Americas Barometer by LAPOP",
x = .1, hjust = 0, vjust= 0,
gp = gpar(fontface = "italic", fontsize = 10)))